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Abstract. Universal screening in elementary schools often includes administering 
curriculum-based measurement in reading (CBM-R); but in first grade, nonsense 
word fluency (NWF) and, to a lesser extent, word identification fluency (WIF) are 
used because of concerns that CBM-R is too difficult for emerging readers. This 
study used Kane’s argument-based approach to validation as a framework to 
evaluate the interpretations and use of scores resulting from screening 257 first- 
and second-grade students. First, scores from three word lists (decodable WIF, 
high-frequency WIF, and whole-word NWF) were examined as indicators of 
reading achievement. Then, the use of these word list scores was evaluated 
regarding their ability to classify at-risk readers accurately and as supplements to 
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CBM-R during the winter universal screening period. Participants were also 
concurrently administered a norm-referenced measure of early reading skills and 
global reading achievement. Results suggested that the word lists were good 
indicators of reading achievement and provided support for using CBM-R or a 
word list in conjunction with CBM-R to discriminate among at-risk readers. 
Findings have implications for the administration of universal screeners in first 
and second grade. 


Universal screening, a core component 
of a Multi-Tiered System of Supports frame¬ 
work, is used for early identification of stu¬ 
dents who may be at risk for learning disabil¬ 
ities. Resultant data are used to inform early 
intervention, which is an effective approach to 
prevent reading difficulties (Vellutino, Scan¬ 
lon, Small, & Fanuele, 2006). Curriculum- 
based measurement in reading (CBM-R) and 
nonsense word fluency (NWF) are often used 
for universal screening in the early elementary 
grades (Deno et ah, 2009). Word identification 
fluency (WIF), albeit less frequently used, is 
another universal screening measure available 
to schools. Although there are clear benefits to 
administering CBM-R, NWF, and WIF, there 
are limitations associated with the use of NWF 
and WIF, and concerns about the ability of 
NWF scores to classify at-risk early readers 
accurately (Clemens, Shapiro, & Thoemmes, 
2011 ). 

CURRICULUM-BASED 
MEASUREMENT IN READING 

CBM-R is a task in which students read 
aloud from grade-level text as the examiner 
listens and records their performance to esti¬ 
mate oral reading rate, which is typically re¬ 
ported in the metric of words read correctly 
per minute (WRCM). One benefit of adminis¬ 
tering CBM-R is that as a general outcome 
measure, it indexes global reading perfor¬ 
mance across the academic year, instead of 
measuring the specific, hierarchically orga¬ 
nized subskills of reading (Fuchs & Deno, 
1991). Although many published studies exist 
indicating that CBM-R is useful for universal 
screening (January & Ardoin, 2015; Kilgus, 
Methe, Maggin, & Tomasula, 2014; Reschly, 
Busch, Betts, Deno, & Long, 2009), the pro¬ 


cedure requires students to integrate the many 
components of skilled reading required to read 
connected text (Fuchs, Fuchs, Hosp, & Jen¬ 
kins, 2001), including decoding and word 
identification. However, many students in the 
early elementary grades are not yet prepared to 
read connected text, so the task may be too 
difficult and may result in poor classification 
accuracy for emerging readers who are at risk 
for developing reading disabilities (Catts, Pet- 
scher, Schatschneider, Bridges, & Mendoza, 
2009; Hosp, Hosp, & Dole, 2011). It is poten¬ 
tially for this reason publishers of curriculum- 
based measurement (CBM) probes recom¬ 
mend that the earliest CBM-R should be ad¬ 
ministered for universal screening is in the 
winter of first grade, and even then, NWF 
should be administered in conjunction with 
CBM-R for the remainder of the year (Good & 
Kaminski, 2007; Pearson, 2012). 

NONSENSE WORD FLUENCY 

In contrast to CBM-R, NWF is a sub¬ 
skill mastery measure that combines sound 
identification and blending of vowel-conso¬ 
nant (VC) and consonant-vowel-consonant 
(CVC) pseudowords to measure students’ let¬ 
ter-sound correspondence, decoding skills, 
and progress as emerging readers (Good, 
Baker, & Peyton, 2009). Evidence indicates 
NWF scores account for a large portion of the 
variance in word reading and pseudoword de¬ 
coding (Burke & Hagan-Burke, 2007; Oslund 
et al., 2012). Research also has demonstrated 
that NWF scores have moderate to strong con¬ 
current and predictive associations with 
CBM-R performance (Burke & Hagan-Burke, 
2007; Cummings, Dewey, Latimer, & Good, 
2011; Harn, Stoolmiller, & Chard, 2008) and 
reading achievement (Fien et al., 2008, 2010). 
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Given that NWF is a decoding task, stu¬ 
dents can use different approaches to correctly 
decode each word. That is, students are able to 
say the individual sounds in each word, par¬ 
tially blend the word, or say the word as a unit. 
Thus, NWF has the potential of providing 
more information about students’ decoding 
skills and potential risk for reading problems 
than other CBM measures. For instance, stu¬ 
dents who decode pseudowords as units (as 
opposed to sound by sound or partial blending 
with or without recoding) score higher on 
NWF probes and subsequent measures of oral 
reading (Harn et al., 2008). Furthermore, stu¬ 
dents who blend nonsense words as units gen¬ 
erally have better phonemic skills and have 
improved automaticity than students who de¬ 
code the individual letter sounds or use a com¬ 
bination of strategies (Cummings et al., 2011; 
Flarn et al., 2008). 

Despite the potential of gaining more 
descriptive information about students’ decod¬ 
ing skills, a limitation is introduced when stu¬ 
dents use different decoding strategies. More 
specifically, variability in the strategies used 
results in a lack of consistency in the skill or 
construct measured within and across NWF 
assessments (Ritchey, 2008), which may af¬ 
fect its relation to measures of reading 
achievement (Harn et al., 2008). Therefore, by 
allowing students to choose their decoding 
strategy, educators cannot be certain which 
decoding skill (e.g., unitization, letter-sound 
correspondence) is measured by the NWF 
probes they administer. 

Another limitation of existing NWF re¬ 
search is that it has almost exclusively exam¬ 
ined the utility of NWF measures for assessing 
kindergarten and first-grade students’ skills, 
despite that decoding skills continue to be an 
important element of reading instruction be¬ 
yond these grade levels, particularly for strug¬ 
gling readers. As such, the potential benefit of 
using NWF scores to differentiate at-risk sec¬ 
ond-grade students has not been examined em¬ 
pirically. Extant research indicates that for stu¬ 
dents in kindergarten, scores from NWF ade¬ 
quately discriminate between those who do 
and those who do not later meet oral reading 
benchmarks (Clemens, Hilt-Panahon, Shapiro, 


& Yoon, 2012), but for first-grade students, 
NWF scores fail to predict which students 
later underachieve in reading (Clemens et al., 
2011; Vanderwood, Linklater, & Healy, 
2008). The inability of NWF performance to 
discriminate among poor readers in first grade 
and the lack of NWF research in later grades 
may be due to existing NWF probes assessing 
a narrow set of skills (i.e., decoding VC and 
CVC pseudowords). It is possible that NWF 
probes that include more complex word types, 
such as consonant-vowel-consonant-e (e.g., 
vate), have the potential to provide educators 
with information about students’ advanced 
phonics and decoding skills, as well as better 
discriminate among at-risk readers. 

WORD IDENTIFICATION FLUENCY 

WIF is yet another alternative for uni¬ 
versal screening. WIF probes require that stu¬ 
dents read a list of high-frequency and/or de- 
codable words in 1 min, directly measuring 
students’ accuracy and speed of real word 
reading (Fuchs, Fuchs, & Compton, 2004). 
Existing research suggests moderate to strong 
associations between WIF scores and perfor¬ 
mance on norm-referenced measures of word 
identification and decoding, passage reading, 
and reading achievement (Clemens et al., 
2011; Fuchs et al., 2004; Zumeta, Compton, & 
Fuchs, 2012). Additionally, research by Cle¬ 
mens et al. (2011) and Fuchs et al. (2004) 
suggested that in the fall of first grade, WIF, as 
compared with NWF, better predicts later 
reading achievement and is a better indicator 
of risk for reading difficulties. Furthermore, 
although WIF was the single most accurate 
early literacy measure for identifying first- 
grade students at risk for reading problems, 
adding one or two additional early reading 
measures, such as NWF or phoneme segmen¬ 
tation fluency, provided a more accurate 
screening battery that identified first-grade stu¬ 
dents at risk for reading failure (Clemens 
et al., 2011). Both Clemens et al. (2011) and 
Fuchs et al. (2004) used investigator-devel¬ 
oped WIF probes consisting of words sampled 
from popular high-frequency word lists (e.g., 
Dolch Word List) made up of both decodable 
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and nondecodable words. Thus, the extent to 
which the WIF probes used in those studies 
measured students’ decoding skills likely var¬ 
ied (Ritchey, 2008). 

Despite WIF demonstrating superiority 
over other first-grade CBM measures such as 
NWF, there are disadvantages that preclude its 
widespread use. First, unlike NWF, structured, 
reliable, and valid WIF probes are not avail¬ 
able from most publishers of CBM probes. 
Therefore, educators must resort to developing 
their own measures of high-frequency word 
reading or simply using generic lists that are 
available (e.g., from interventioncentral.org). 
Although educator-developed high-frequency 
word lists may provide valuable information, 
they lack structure, are not validated as indi¬ 
cators of reading achievement, lack adequate 
norms to compare student performance for 
benchmarking, and do not have equivalent 
forms for progress monitoring. Furthermore, if 
structured and validated WIF probes were de¬ 
veloped and made widely available, educators 
could have greater confidence when using 
them to make decisions about which students 
are not meeting reading benchmarks. Unfortu¬ 
nately, to date, researchers have not examined 
(a) whether there is any added benefit of con¬ 
currently administering WIF or NWF probes 
with CBM-R or (b) whether there is any ben¬ 
efit to administering WIF probes that are com¬ 
posed solely of decodable words. 

AN ARGUMENT-BASED APPROACH 
TO VALIDATION 

Kane’s (2013a, 2013b) argument-based 
approach to validation is a practical frame¬ 
work for evaluating the decisions that are 
made based on observed (test) scores, includ¬ 
ing results from universal screenings. This 
framework posits that the interpretations and 
uses of observed scores must be explicitly 
stated (referred to as the interpretation/use 
argument [IUA]) and then evaluated system¬ 
atically (validation). When the IUA and its 
assumptions are sufficiently supported by ev¬ 
idence, the uses and interpretations of test 
scores can be regarded as valid. The IUA 
includes a set of three hierarchically organized 


inferences that should be examined empiri¬ 
cally: scoring, generalization, and extrapola¬ 
tion. Scoring inferences are based on the pro¬ 
cess by which an observed performance (e.g., 
a student reading connected text aloud) is 
transformed into an observed score (e.g., 
WRCM) through scoring rules (e.g., a word 
that is misread counts as an error). Evidence 
(i.e., validation) that scoring rules are applied 
appropriately includes adequate interscorer 
agreement/interrater reliability. The general¬ 
ization inference refers to the assumption that 
scores at one point in time generalize across 
several observation conditions (e.g., occa¬ 
sions, raters). Reliability metrics such as the a 
coefficient, alternate-form reliability, and test- 
retest reliability provide evidence of the gen- 
eralizability of observed scores. Extrapolation 
inferences consider how well the observed 
score indicates performance in a larger do¬ 
main, either concurrently or in the future. An 
example of a validated extrapolation inference 
is that CBM-R scores are indicative of the 
larger domain of reading achievement (e.g., 
January & Ardoin, 2015; Reschly et al., 2009). 
Indeed, the argument-based approach is well 
suited as a framework for validating the inter¬ 
pretations and use of results from universal 
screening in schools (Christ & Nelson, 2014). 

In the case of universal screening in 
reading, the IUA is all of the inferences and 
decisions that are made based on the resultant 
data. That is, it is assumed that (a) scoring 
rules were applied accurately (scoring infer¬ 
ence), (b) observed scores generalize across 
observations (generalization inference), and 
(c) scores from universal screening assess¬ 
ments are indicators of students’ reading 
achievement (extrapolation inference). On the 
basis of these inferences, universal screening 
data are used to make decisions regarding 
whether a student may benefit from more in¬ 
tensive instruction in reading. Schools typi¬ 
cally conduct universal screenings three times 
per year; thus, inferences and decisions are 
made based on the resultant data from each 
universal screening period. Therefore, it is im¬ 
portant that the interpretations and use of 
screening data are validated within the context 
of universal screening. 
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THE CURRENT STUDY 

The present study aimed to replicate and 
extend the existing universal screening litera¬ 
ture by examining procedures for evaluating 
early elementary students’ achievement in 
reading. The current study extended this re¬ 
search by using NWF probes that measured 
skills beyond the decoding of VC and CVC 
words and requiring students to read the pseu¬ 
dowords that make up NWF probes as units. 
By requiring students to read nonwords as 
units, NWF probes assess the same skill for 
all students, as opposed to data reflecting some 
students’ letter-sound knowledge and other 
students’ blending skills. We also added to the 
existing literature by evaluating students’ 
word-reading fluency on a WIF probe consist¬ 
ing of solely decodable words, in addition to a 
probe consisting of high-frequency words. 
This is in contrast to previous studies (e.g., 
Clemens et al., 2011; Fuchs et al., 2004; Zu- 
meta et al., 2012), in which a single WIF 
probe consisted of both decodable and non- 
decodable high-frequency words. By admin¬ 
istering both a solely decodable word list 
and a high-frequency word list, we explored 
whether the type of words used in WIF 
probes is meaningful in predicting students’ 
reading achievement. 

The current study used Kane’s (2013a, 
2013b) argument-based framework to evaluate 
the validity evidence for universal screening 
data in the early elementary grades. Although 
evidence for the scoring and generalization 
inferences is not a primary focus of this study, 
this information will be presented in the 
Method section. Thus, the interpretation of 
interest is as follows: NWF and WIF scores 
are good indicators of the larger domain of 
reading achievement for first- and second- 
grade students (i.e., extrapolation inferences). 
Previous research suggests a moderate to 
strong relation between the subskills of decod¬ 
ing (as measured by NWF) and word-reading 
skills (as measured by WIF) with students’ 
reading achievement, as measured by their 
performance on CBM-R probes and norm- 
referenced measures. Therefore, the first pur¬ 
pose of this study was to evaluate whether 


NWF and WIF scores are adequate indicators 
of word analysis skills (i.e., decoding, phono¬ 
logical awareness) and global reading achieve¬ 
ment, as measured by CBM-R and a nationally 
norm-referenced test that was administered 
concurrently. We also determined which word 
list (WIF, NWF) was a better indicator of 
early reading skills and reading achievement. 

Kane’s (2013a, 2013b) argument-based 
approach to validation was also used to eval¬ 
uate the use of universal screening data to 
identify students who are at risk. Extant re¬ 
search examining the classification accuracy 
of NWF and WIF scores has suggested that 
WIF might be more accurate for identifying 
at-risk readers in first grade (Clemens et al., 
2011) and has questioned the utility of admin¬ 
istering CBM-R when screening early readers 
(Catts et al., 2009; Hosp et al., 2011). How¬ 
ever, because it is a general outcome measure, 
CBM-R may be most appropriate for universal 
screening, instead of using subskill mastery 
measures that assess the component skills of 
reading. Thus, the second purpose of the cur¬ 
rent study was to evaluate the accuracy of the 
decisions that are made with CBM-R, NWF, 
and WIF scores regarding whether first- and 
second-grade students are at risk for reading 
difficulties. We were also interested in deter¬ 
mining if classification accuracy could be im¬ 
proved when either an NWF or WIF probe is 
administered in conjunction with CBM-R. To 
address the second purpose of this study, we 
evaluated the classification accuracy of each 
screening measure alone and then with 
CBM-R to identify at-risk students, as mea¬ 
sured by a concurrently administered norm- 
referenced measure of global reading achieve¬ 
ment. 

METHOD 

Potential participants were initially re¬ 
cruited to be a part of a large study validating 
CBM-R for universal screening and progress 
monitoring in Grades 1-5 (Pratt et al., 2011; 
White et al., 2011). For the purposes of the 
current study, data were collected as part of 
two elementary schools’ routine assessment 
procedures. Participating schools were part of 
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two school districts located within the South¬ 
eastern United States. There were 10 first- 
grade classrooms (4 in School A, 6 in School 
B) and 9 second-grade classrooms (4 in 
School A, 5 in School B) represented in this 
study. School-wide, 19% of students in School 
A qualified for free or reduced-price meals and 
approximately 71% of students in School B 
qualified for free or reduced-price meals. 

Participants 

All students (N = 287) who were pres¬ 
ent during the winter universal screening win¬ 
dow were recruited as participants. However, 
students who were English learners (5 in first 
grade, 25 in second grade) were excluded 
from the analyses for this study because of 
potential bias in using CBM-R scores for uni¬ 
versal screening (Hosp et al., 2011). Thus, all 
remaining participants (n = 257) were native 
English speakers. There were 135 first-grade 
students (69 from School A, 66 from School 
B), who were primarily male (59.3%) and 
ranged in age from 6.41 to 8.31 years 
(M = 7.02 years, SD = 0.38 years). The racial 
and ethnic composition of the first-grade stu¬ 
dents was 82.2% White, 6.7% African Amer¬ 
ican, 4.4% Hispanic or Latino, 2.2% Asian, 
and 4.4% other or not specified. Approxi¬ 
mately 3.7% of first-grade students were eli¬ 
gible for special education services. Just over 
half of the 122 second-grade students (60 from 
School A, 62 from School B) were male 
(54.1%); the second-grade students ranged in 
age from 6.85 to 9.27 years (M = 7.99 years, 
SD = 0.41 years). The racial and ethnic 
composition of the second-grade students 
was 73.8% White, 8.2% Hispanic or La¬ 
tino, 7.4% African American, 4.1% Asian, 
and 6.6% other or not specified. Of the sec¬ 
ond-grade students, 5.7% were eligible for 
special education services. 

Measures 

All participants were administered two 
CBM-R probes; one decodable WIF probe 
(WIF-D); one high-frequency WIF probe 
(WIF-HF); a whole-word NWF probe (NWF- 
whole) that required blending words as units; 


and the Iowa Test of Basic Skills (ITBS; 
Hoover, Dunbar, & Frisbie, 2001). The word 
lists used in this study are similar to those 
developed and published by a screening 
and progress-monitoring assessment system 
(Christ et al., 2014) with measures demon¬ 
strating adequate reliability and validity. Un¬ 
less otherwise noted, the dependent measure 
for each universal screener was WRCM. 

Decodable WIF 

The authors developed separate WIF-D 
probes for each grade, with each list consisting 
of 304 phonetically-regular words. In the de¬ 
velopment of both lists, decodability guide¬ 
lines set forth by Menon and Hiebert (1999) 
were employed, which include CV words at 
Level 1; the words become increasingly diffi¬ 
cult, based on linguistic decoding patterns, 
with multisyllabic words at Level 8. The first- 
grade WIF-D consisted of words that met the 
guidelines for decodability Levels 1 through 5. 
For example, words on the first-grade WIF-D 
included pop (Level 2), dent (Level 3), cape 
(Level 4), and breeze (Level 5). The second- 
grade list included 159 words from Levels 1-5 
and 145 words from Levels 6 (e.g., car), 1 
(e.g., south), and 8 (e.g., problem). Evidence 
for the generalization inference for the WIF-D 
probe is reflected in adequate internal consis¬ 
tency (a = .98), test-retest reliability (r = 
.94), and alternate-form reliability (r = .94; 
Christ et al., 2014). 

High-Frequency WIF 

Separate first- and second-grade WIF- 
HF probes were also developed by the authors, 
with each word list consisting of 304 high- 
frequency words that were decodable and non- 
decodable. Words were selected from two 
commonly used high-frequency word lists, 
the 315 Dolch Word List (Johns, 1971) and 
the New Instant Word List (Fry, 1980). The 
WIF-HF probe for each grade included words 
from both lists, but the second-grade list in¬ 
cluded a greater number of less frequent words 
from the New Instant Word List. WIF-HF 
probes have adequate alternate-form reliability 
(r = .94), internal consistency (a = .99), and 
test-retest reliability (r = .97), providing ev- 
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idence of the generalization inference (Christ 
et al„ 2014). 

Whole-Word NWF 

Similar to the WIF-D probes, separate 
grade-level NWF-whole probes were devel¬ 
oped using the decodability levels outlined by 
Menon and Hiebert (1999). NWF-whole 
probes consisting of 304 decodable pseudo¬ 
words were developed for each grade level. 
The first-grade probe included 304 pseudo¬ 
words from Levels 1-5, and the second-grade 
probe included pseudowords from Levels 2-8. 
The generalization inference for NWF-whole 
probes is supported by adequate alternate- 
form reliability O = .85), test-retest reliability 
{r = .76), and internal consistency (a = .96; 
Christ et al„ 2014). 

Curriculum-Based Measurement in 
Reading 

First-grade students were administered 
two CBM-R probes—one investigator-devel¬ 
oped preprimer probe and one brst-grade level 
CBM-R probe—selected from the easyCBM 
passage set (www.easycbm.com). The prep¬ 
rimer probe developed by the authors in¬ 
cluded 88 unique words (258 total words), 
57% of which were high-frequency words. 
Second-grade students were administered the 
brst-grade level probe that was administered 
to the brst-grade students and a second-grade 
level probe from the easyCBM passage set. By 
administering a passage that was below grade 
level, we were attempting to increase the pos¬ 
sibility that CBM-R scores could be used to 
distinguish among struggling students. We 
hoped that an easier passage might result in 
greater differences among those students who 
had difficulty reading their grade-level pas¬ 
sage. Furthermore, given the number of probes 
that were administered to participants, we ad¬ 
ministered only one grade-level and one below 
grade-level CBM-R probe as opposed to the 
three CBM-R probes that are traditionally ad¬ 
ministered as part of universal screenings. Ad¬ 
ditionally, previous research suggests that 
administering one CBM-R probe instead of 
three is appropriate for universal screening 
purposes (Ardoin et al., 2004). The reliability 


and validity of easyCBM passages are ade¬ 
quate (Jamgochian et ah, 2010; Lai et ah, 
2010), and are similar to other commonly used 
CBM probes. The average WRCM across the 
two probes was used as the dependent 
measure. 

Iowa Test of Basic Skills 

The ITBS is a group-administered and 
nationally norm-referenced assessment for 
kindergarten through eighth-grade students 
(Hoover et ah, 2001). Students were adminis¬ 
tered either Form A, Level 7 (brst grade) or 
Form A, Level 8 (second grade). For the pur¬ 
poses of this study, the ITBS-Total Reading 
composite (ITBS-TR), which estimates stu¬ 
dents’ vocabulary and reading comprehension 
skills, and the ITBS-Word Analysis subtest 
(ITBS-WA), which assesses students’ phono¬ 
logical awareness, decoding, and understand¬ 
ing of word parts, were used. The ITBS-WA 
was selected for the current study given that it 
measures students’ early reading skills. The 
ITBS-TR and ITBS-WA have adequate 
Kuder-Richardson Formula 20 internal con¬ 
sistency in brst grade (.93 and .85, respec¬ 
tively) and second grade (.94 and .85, respec¬ 
tively; Hoover et ah, 2001). The content-re¬ 
lated validity of the ITBS was established 
through an extensive development process that 
included a curriculum review, preliminary 
item tryout, national item tryout, fairness re¬ 
view, and development of individual tests 
(Hoover et ah, 2001). 

The current study used ITBS Develop¬ 
mental Standard Scores (SSs) as the dependent 
measure. Developmental SSs were created us¬ 
ing 200 as the median score for fourth-grade 
students and 250 as the median score for 
eighth-grade students. Thus, students’ SSs 
indicate their performance along an achieve¬ 
ment continuum from kindergarten through 
Grade 8. In the standardization sample, brst- 
grade students’ ITBS-TR SSs averaged 151.3 
(SD = 13.15) and ITBS-WA SSs averaged 
152.2 (SD = 18.4). For the second-grade stu¬ 
dents in the standardization sample, ITBS-TR 
SSs averaged 170.0 (SD = 19.1) and 

ITBS-WA SSs averaged 171.0 (SD = 23.7). 
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Procedures 

Students were administered the ITBS 
during winter of the academic year by their 
classroom teachers, who followed standard¬ 
ized administration procedures. Within 1 week 
of ITBS administration, examiners individu¬ 
ally administered the WIF-D, WIF-HF, NWF- 
whole, and CBM-R probes in random order, 
counterbalanced across all participants during 
one session. For the CBM-R probes, standard¬ 
ized administration and scoring procedures 
were followed as students were instructed to 
read across the page and down, were in¬ 
structed to do their best reading, and were 
instructed that if they did not know a word, it 
would be told to them. Substitutions, skipped 
words, misread words, and words that were 
not read within 3 s were counted as errors and 
used to calculate WRCM. With the exception 
of students being told they would be reading a 
list of words, the administration and scoring 
procedures were identical for the WIF probes. 
NWF-whole administration procedures were 
modified from typical NWF procedures. That 
is, students were told they would be reading a 
list of pseudowords and were instructed to 
read the words as whole words and not sound 
by sound. To ensure that students understood 
the instructions, they were administered prac¬ 
tice items and were provided with corrective 
feedback prior to being administered the word 
list. Scoring procedures were modified also, as 
only pseudowords read accurately as units 
were scored as correct. WRCM for the NWF- 
whole task was calculated by subtracting the 
total number of words read by the total num¬ 
ber of errors (i.e., words read sound-by-sound, 
skipped words, misread words, and words that 
were not read within 3 s). 

Procedural Integrity and Interscorer 
Agreement 

Examiners were school psychology 
graduate students and undergraduate research 
assistants who participated in an hour-long 
training session led by the second author. Ex¬ 
aminers were trained until they were 100% 
reliable on three consecutive probes. Prior to 
collecting data independently, examiners ob¬ 


served the second author complete an admin¬ 
istration, were observed as they conducted an 
administration, and then were provided with 
feedback. If examiners completed 100% of the 
procedures accurately, they transitioned to 
collecting data independently. Otherwise, on¬ 
site training procedures were repeated until 
examiners accurately completed all required 
steps. All experimental sessions were audio 
recorded, and recordings were used to calcu¬ 
late procedural integrity and interscorer agree¬ 
ment of 15% of experimental sessions. Exam¬ 
iners adhered to a procedural checklist, and 
procedural integrity was calculated by divid¬ 
ing the number of correctly completed steps 
by the total number of steps (40), multiplied 
by 100 to obtain a percentage. Across exam¬ 
iners, procedural integrity averaged 98% 
(range = 83%-100%). Interscorer agreement 
was calculated by dividing the number of 
agreements by the number of agreements plus 
disagreements, multiplied by 100 to obtain a 
percentage. Interscorer agreement averaged 
99% for CBM-R (range = 91%-100%), 98% 
for WIF-HF (range = 91%-100%), 94% for 
WIF-D (range = 74%-100%), and 90% for 
NWF-whole (range = 67%-100%), providing 
evidence of appropriate scoring inferences. 
Although the interscorer agreement for the 
NWF-whole probes was lower than expected, 
there were only a few outliers (i.e., four fell 
below 75%). 

Data Analyses 

Evidence for the extrapolation infer¬ 
ences was obtained by using Pearson product- 
moment correlations to examine the concur¬ 
rent relation between the WIF-D, WIF-HF, 
and NWF-whole scores and students’ ITBS- 
WA, ITBS-TR, and CBM-R performance. The 
magnitude of correlation coefficients was 
compared by use of Cohen’s (1988) general 
guidelines, wherein point estimates <.29 are 
considered small ; .30 to .49, moderate', .50 to 
.69, large', and coefficients >.70, very large. 
Then, the extent to which scores from WIF-D, 
WIF-HF, or NWF-whole were better indica¬ 
tors of early reading skills (ITBS-WA) and 
global reading achievement (ITBS-TR) was 
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evaluated using guidelines for comparing cor¬ 
relation coefficients that were delineated by 
Steiger (1980). That is, each correlation coef¬ 
ficient was transformed into a z score, and 
statistical significance between pairs of coef¬ 
ficients was then evaluated using equations 
detailed by Steiger (1980) that accounted for 
the fact that correlations are dependent (i.e., 
from the same sample) and have one variable 
in common (i.e., ITBS-WA or ITBS-TR). 

To address the second purpose of this 
study, which was to evaluate the classification 
accuracy of each screening measure (WIF-D, 
WIF-HF, NWF-whole, CBM-R) and to deter¬ 
mine whether adding a subskill mastery mea¬ 
sure (WIF-D, WIF-HF, or NWF-whole) to 
CBM-R would improve classification accu¬ 
racy, students in each grade were classified as 
at risk or not at risk, based on their ITBS-TR 
scores. For these analyses, students with 
scores at or below the 25th percentile were 
classified as at risk and those scoring above 
the 25th percentile were classified as not at 
risk. Therefore, risk was used as a dichoto¬ 
mous variable. The 25th percentile was se¬ 
lected because it corresponds with below-av- 
erage performance on the ITBS. Next, several 
regression analyses were conducted with the 
screening measures predicting students’ risk 
status. First, each predictor was entered sepa¬ 
rately in series of logistic regressions. Then, in 
a series of sequential logistic regressions, 
CBM-R and each subskill mastery measure 
were entered together to determine the classi¬ 
fication accuracy gained by adding WIF-D, 
WIF-HF, or NWF-whole to CBM-R. For each 
logistic regression, the associated predicted 
probabilities were saved so that receiver oper¬ 
ating characteristic curves could be conducted 
to further evaluate the classification accuracy 
of the screeners to predict risk status. 

Several statistics were used to evaluate 
the classification accuracy of the universal 
screening measures (see Christ & Nelson, 
2014, for a review). In the present study, sen¬ 
sitivity is the percentage of students deter¬ 
mined to be at risk on the ITBS-TR (i.e., 
scored at or below the 25th percentile) who 
were accurately classified by the screener as 
being at risk. Positive predictive value (PPV) 


refers to the percentage of students accurately 
predicted to be at risk by the screener and can 
be viewed as how much the screener overi¬ 
dentifies students as at risk. Specificity is the 
percentage of students determined to be not at 
risk on the ITBS-TR who were correctly clas¬ 
sified by the screener as not at risk. Negative 
predictive value (NPV) is the percentage of 
students accurately predicted as not at risk by 
the screener. Researchers have suggested that 
screening measures should be able to identify 
at least 90% of students at risk (Jenkins, Hud¬ 
son, & Johnson, 2007). Researchers also have 
suggested that a good screener should have at 
least 80% specificity (Compton et al., 2010). 
As such, sensitivity values were set as close to 
90% as possible and then the specificity, PPV, 
and NPV of each measure or combination of 
measures were obtained and compared. Fi¬ 
nally, the area under the curve (AUC) is a 
measure of the overall classification accuracy 
of the predictors, as .50 indicates a screener 
(or set of screeners) has a classification accu¬ 
racy that is no greater than chance and 1.0 
represents perfect classification accuracy. It is 
generally accepted that AUC values of .90- 
1.0 are excellent and .85-. 89 are good (Christ 
& Nelson, 2014); however, screeners with 
AUC values <.85 are not recommended for 
making screening decisions (Center on Re¬ 
sponse to Intervention, 2015). 

RESULTS 

Prior to analyses being conducted, it was 
determined that all variables were normally 
distributed. Descriptive statistics and correla¬ 
tions for all study variables are presented in 
Table 1. Chi-square analyses conducted to in¬ 
vestigate potential differences in student per¬ 
formance on each measure as a function of 
school attended revealed no significant differ¬ 
ences in first grade; however, second-grade 
students in School A had a significantly higher 
performance on ITBS-TR than second-grade 
students in School B (p < .05). No other 
significant differences in second-grade mea¬ 
sures were observed. Additionally, results of 
the Fisher’s exact test indicated no statistically 
significant differences across schools in the 
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Table 1. Descriptive Statistics and Intercorrelations Among Study Variables 


Variable 

M 

SD 

Range 

1 

2 

3 

4 

5 6 

First grade 2 









1. Average CBM-R 

63.39 

40.47 

2.5-156.5 

— 





2. WIF-HF 

38.44 

26.49 

2-97 

.94* 

— 




3. WIF-D 

23.95 

19.25 

0-72 

.90* 

.94* 

— 



4. NWF-whole 

17.05 

15.61 

0-70 

.85* 

.86* 

.93* 

— 


5. ITBS-WA 

153.84 

18.43 

124-202 

.71* 

.69* 

OS 

OS 

* 

.63* 

— 

6. ITBS-TR 

154.14 

17.74 

121-195 

.89* 

.83* 

.81* 

.77* 

.77* — 

Second grade* 3 









1. Average CBM-R 

100.31 

39.63 

7.5-202 

— 





2. WIF-HF 

52.68 

23.77 

5-110 

.85* 

— 




3. WIF-D 

35.03 

22.69 

4-100 

.86* 

.91* 

— 



4. NWF-whole 

20.21 

16.02 

0-76 

.79* 

.82* 

.91* 

— 


5. ITBS-WA 

167.94 

23.05 

121-233 

.64* 

.58* 

C/i 

oo 

* 

.57* 

— 

6. ITBS-TR 

170.30 

19.00 

131-215 

.81* 

.69* 

.71* 

.64* 

.72* — 


Note. CBM-R = curriculum-based measurement in reading; ITBS-TR = Iowa Test of Basic Skills-Total Reading 
composite; ITBS-WA = Iowa Test of Basic Skills-Word Analysis subtest; NWF-whole = whole-word nonsense word 
fluency; WIF-D = decodable word identification fluency; WIF-HF = high-frequency word identification fluency. 
a n = 135. 
b n = 122. 

*p < .001. 


percentages of students who were classified as 
at risk or not at risk in first and second grade. 

Evidence for Extrapolation Inferences 

Results indicated that WIF-D, WIF-HF, 
and NWF-whole scores had statistically sig¬ 
nificant (p < .001) associations with ITBS- 
WA, ITBS-TR, and CBM-R performance, 
with coefficients being slightly larger in 
magnitude for first-grade students than for 
second-grade students. With ITBS-WA, co¬ 
efficients were large and ranged from .63 to 
.69 in first grade and from .56 to .58 in 
second grade. Associations between the 
word list scores and ITBS-TR performance 
were large to very large in magnitude, rang¬ 
ing from .77 to .83 in first grade and from 
.64 to .71 in second grade. A similar pattern 
was evident in the correlations between the 
word list scores and CBM-R performance in 
first grade (r = .85-.83) and second grade 
(r = .79-.85). 

Although there were no significant dif¬ 
ferences in the associations between the word 
list and ITBS-WA scores in first and second 


grade (p > .05), results of the statistical tests 
indicated a few significant differences in co¬ 
efficients between the word list and ITBS-TR 
scores. In first grade, WIF-HF scores had a 
significantly greater association with ITBS-TR 
performance than did NWF scores (p = .019) 
and WIF-D scores had a significantly larger 
association with ITBS-TR performance than 
did NWF-whole scores (p = .037). However, 
there was not a significant difference be¬ 
tween WIF-D and WIF-HF scores in their 
relation to ITBS-TR performance (p > .05). 
For second-grade students, the association 
between WIF-D and ITBS-TR scores was 
significantly greater than the association be¬ 
tween NWF-whole and ITBS-TR perfor¬ 
mance (p = .012). No significant differ¬ 
ences were observed between WIF-HF and 
NWF-whole or WIF-HF and WIF-D in their 
relation to ITBS-TR performance (p > .05). 

Classification Accuracy of Universal 
Screeners 

In first grade, 18% of students (n = 24) 
scored at or below the 25th percentile on 
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Table 2. Classification Accuracy of Screening Measures to Predict ITBS-TR 
Risk Status 


Sensitivity ~ 90% 


Screening Measure (s) 

AUC 

SE 

95% Cl 

Specificity (%) 

PPV (%) 

NPV (%) 

First grade 2 

CBM-R 

.973 

.013 

[.948, .997] 

94 

76 

98 

WIF-HF 

.941 

.022 

[.898, .983] 

88 

63 

98 

WIF-D 

.940 

.024 

[.893, .987] 

85 

58 

98 

NWF-whole 

.885 

.034 

[.818, .952] 

72 

40 

98 

CBM-R + WIF-HF 

.974 

.012 

[.950, .997] 

94 

71 

98 

CBM-R + WIF-D 

.976 

.014 

[.948, 1.000] 

96 

85 

98 

CBM-R + NWF-whole 

.972 

.014 

[.944, 1.000] 

96 

81 

98 

Second grade b 

CBM-R 

.957 

.027 

[.905, 1.000] 

87 

59 

98 

WIF-HF 

.927 

.036 

[.857, .997] 

73 

41 

98 

WIF-D 

.968 

.017 

[.934, 1.000] 

91 

65 

98 

NWF-whole 

.946 

.023 

[.901, .991] 

86 

55 

98 

CBM-R + WIF-HF 

.956 

.027 

[.902, 1.000] 

88 

61 

98 

CBM-R + WIF-D 

.965 

.027 

[.912, 1.000] 

97 

85 

98 

CBM-R + NWF-whole 

.965 

.028 

[.910, 1.000] 

99 

94 

98 


Note. AUC = area under the curve; CBM-R = curriculum-based measurement in reading; ITBS-TR = Iowa Test of 

Basic Skills-Total Reading composite; NPV = negative predictive value; NWF-whole = whole-word nonsense word 

fluency; PPV = positive predictive value; WIF-D = decodable word identification fluency; WIF-HF = high-frequency 

word identification fluency. 

a The at-risk base rate is 18% (n = 24). 

b The at-risk base rate is 17% (n = 21). 


ITBS-TR and were subsequently classified as 
at risk. As indicated in Table 2, all the screen- 
ers’ individual classification accuracy was ac¬ 
ceptable; however, CBM-R had the greatest 
AUC (.973), as compared with WIF-HF 
(.941), WIF-D (.940), and NWF (.885). The 
AUC for CBM-R 4- WIF-D (.976) was only 
slightly greater than that for CBM-R alone, 
CBM-R + WIF-HF (AUC = .974), and 
CBM-R + NWF-whole (AUC = .972). With 
sensitivity values set near 90%, NPVs were all 
98% and CBM-R had the highest specificity 
and PPV (94% and 76%, respectively), 
followed by WIF-HF (88% and 63%, re¬ 
spectively), WIF-D (85% and 58%, respec¬ 
tively), and NWF (72% and 40%, respec¬ 
tively). When compared with CBM-R alone, 
the combination of CBM-R + WIF-D 
increased sensitivity by 2% and PPV by 9% 
and CBM-R + NWF-whole resulted in a 2% 


increase in specificity and a 5% increase in 
PPV. However, adding WIF-HF to CBM-R 
made no difference in specificity and re¬ 
duced PPV by 5%. 

In second grade, 17% of students (n = 
21) were classified as at risk. When the overall 
classification accuracy of the measures in 
predicting second-grade students’ ITBS-TR 
risk status was compared (see Table 2), each 
screener was adequate, as WIF-D had the 
greatest AUC (.968) as compared with 
CBM-R (.957), NWF-whole (.946), and 
WIF-HF (.927). The AUC for CBM-R + 
WIF-D and CBM-R + NWF-whole was the 
same (.965), which was greater than the 
AUC for CBM-R alone and CBM-R + 
WIF-HF (AUC = .956). With sensitivity 
values set near 90%, WIF-D had the greatest 
specificity and PPV (91% and 65%, respec¬ 
tively), followed by CBM-R (87% and 
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59%, respectively), NWF-whole (86% and 
55%, respectively), and WIF-HF (73% and 
41%, respectively). Furthermore, adding 
NWF-whole to CBM-R resulted in the great¬ 
est increase in specificity (12%) and PPV 
(35%) compared with CBM-R alone, and 
CBM-R + WIF-D yielded a 10% increase in 
specificity and 26% increase in PPV. Con¬ 
versely, CBM-R + WIF-HF resulted in a 
1% increase in specificity and a 2% increase 
in PPV. 

DISCUSSION 

Schools often use CBM-R and NWF 
probes for universal screening in first grade 
and CBM-R exclusively in second grade, even 
though there might be benefits to administer¬ 
ing WIF probes in first grade (Clemens et al., 
2011; Fuchs et al., 2004) and second grade. 
Recent research in fact suggests that WIF 
scores explain variance in student achieve¬ 
ment beyond NWF scores and WIF was the 
single most accurate screening measure in first 
grade (Clemens et al., 2011). Such findings 
may be due to WIF probes assessing skills not 
measured by NWF probes, including students’ 
recognition of high-frequency words, their 
skills in decoding words that are more com¬ 
plex than CVC words, and their ability to 
decode words as units. In an attempt to ad¬ 
dress these issues, we used subskill mastery 
probes developed to measure students’ ad¬ 
vanced decoding skills (NWF-whole, WIF-D) 
and students’ reading of high-frequency words 
(WIF-HF) as well as CBM-R. Kane’s (2013a, 
2013b) argument-based approach to validation 
was used to evaluate the interpretations (i.e., 
extrapolation inferences) and use (i.e., deci¬ 
sions regarding at-risk status) of WIF-D, 
WIF-HF, NWF-whole, and CBM-R to iden¬ 
tify at-risk readers in first and second grade. 
First, we evaluated the extent to which 
WIF-D, WIF-HF, and NWF-whole scores 
were indicators of early reading skills, as 
measured by ITBS-WA, and global reading 
achievement, as measured by ITBS-TR (i.e., 
extrapolation inferences). Next, we evalu¬ 
ated the decisions made based on scores 
from universal screening measures by exam¬ 


ining the classification accuracy of each 
screener and determined whether adminis¬ 
tering a WIF-D, WIF-HF, or NWF-whole 
probe with CBM-R would yield improve¬ 
ments in identifying students at risk for 
reading difficulties. 

Evidence for Extrapolation Inferences 

Our findings provide evidence for the 
extrapolation inferences that WIF-D, WIF-HF, 
and NWF-whole scores are good indicators of 
the larger domains of reading achievement and 
early reading skills. That is, as in previous 
research with first-grade students (Clemens 
et al., 2011; Cummings et al., 2011; Fien et al., 
2010), strong associations between WIF-D, 
WIF-HF, NWF-whole, and norm-referenced 
measures of early reading skills and global 
reading achievement were observed. We also 
extended those findings to second-grade stu¬ 
dents, with the relations among variables be¬ 
ing similar in magnitude to those observed in 
first grade. In first grade, the WIF-D and 
WIF-HF scores demonstrated a statistically 
larger association with ITBS-TR performance 
than the relation between NWF-whole and 
ITBS-TR, suggesting that the WIF measures 
were better indicators of global reading 
achievement than NWF-whole. However, in 
second grade, WIF-D had a significantly 
greater association with ITBS-TR than did 
NWF-whole, whereas WIF-HF performance 
was similar to NWF-whole and WIF-D in 
their relation to ITBS-TR. Moreover, a 
slightly larger relation between each 
screener and ITBS-WA (e.g., decoding, pho¬ 
nological awareness) and ITBS-TR was ob¬ 
served in first grade as compared with second 
grade. Results also extend prior research in 
that the word lists administered differed from 
those used in previous studies, which used 
WIF probes consisting of words that were not 
controlled for decodability. We administered a 
similar WIF probe (WIF-HF) but also admin¬ 
istered a structured WIF probe that consisted 
of only decodable words (WIF-D) to investi¬ 
gate potential differences in their association 
with reading achievement. 
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Classification Accuracy of Universal 
Screeners 

Findings from this study add to an ex¬ 
isting body of research supporting the use of 
CBM-R as a universal screening assessment 
(Kilgus et ah, 2014). In first grade, scores from 
CBM-R demonstrated the greatest overall 
classification accuracy as compared with each 
subskill mastery measure. Moreover, when 
sensitivity was examined at 90% and a speci¬ 
ficity guideline of 80% was used, either 
CBM-R, WIF-HF, or WIF-D was appropriate; 
however, CBM-R identified the greatest num¬ 
ber of first-grade students not at risk and overi¬ 
dentified the fewest number of students (i.e., 
false positives). Notably, NWF-whole, which 
is widely administered in first grade for uni¬ 
versal screening, had the lowest classification 
accuracy when sensitivity was set at 90%. 
Thus, although the NWF-whole probes devel¬ 
oped in the current study required unitization 
and measured a range of decoding skills, re¬ 
sults were consistent with prior research sug¬ 
gesting that NWF does not accurately discrim¬ 
inate among at-risk readers in first grade 
(Clemens et al., 2011; Johnson, Jenkins, Pet- 
scher, & Catts, 2009; Vanderwood et al., 
2008). On the basis of these results, it would 
appear that the subskill mastery measures (and 
particularly NWF-whole) have little utility 
when administered alone as universal screen¬ 
ers, as they are best at identifying students 
who are not at risk as opposed to accurately 
identifying those who are at risk. 

Results of this study suggest that during 
the winter universal screening period, CBM-R 
is the single most accurate screening measure 
for first-grade students at risk for reading dif¬ 
ficulties. This finding is consistent with previ¬ 
ous research suggesting that when CBM-R is 
administered in the fall of first grade, it 
classifies at-risk readers better than NWF 
(Johnson et al., 2009). Furthermore, im¬ 
provements in the classification accuracy of 
CBM-R by adding a subskill mastery mea¬ 
sure varied based on the measure. That is, 
although improvements were relatively 
small, adding WIF-D to CBM-R resulted in 
the greatest increase in specificity and PPV 


(holding sensitivity at 90%) over CBM-R 
alone. CBM-R + NWF-whole produced 
even smaller improvements in classification 
accuracy, and administering CBM-R + 
WIF-HF did not offer additional accuracy in 
classifying students at risk for reading 
difficulty. 

For second-grade students, findings 
from the present study support the use of 
scores from CBM-R and subskill mastery 
measures for classifying at-risk readers. 
WIF-D had the highest overall classification 
accuracy, and with sensitivity at 90%, WIF-D 
was most accurate at classifying students who 
were not at risk and overidentified fewer stu¬ 
dents than did CBM-R, WIF-HF, and NWF- 
whole. This finding is particularly interesting, 
given that WIF-D is not typically administered 
in second grade. However, when the classifi¬ 
cation accuracy of adding a subskill mastery 
measure to CBM-R was examined with sensi¬ 
tivity set at 90%, a slightly different pattern of 
findings was evident, when compared to the 
results of the statistical optimization. That is, 
although CBM-R + WIF-D produced a large 
increase in specificity and PPV, adding NWF- 
whole to CBM-R yielded the greatest increase 
in classification accuracy over CBM-R alone. 
It may be that NWF-whole better captured the 
range of students’ decoding skills and, there¬ 
fore, was an appropriate complement to 
CBM-R. 

Limitations and Future Research 
Directions 

Findings from this study should be in¬ 
terpreted with several limitations considered. 
First, NWF-whole administration procedures 
differed from those used in previous research, 
as well as from typical assessment practices, 
in that students were asked to say the pseudo¬ 
words as units without the option to provide 
the individual sounds in each word. We chose 
to use such procedures in an attempt to ensure 
that the same skill (i.e., blending of sounds) 
was being measured across participants, as 
previous research suggests giving students the 
option to provide the individual sounds or the 
entire word results in variability in the con- 
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struct measured (Ritchey, 2008). It is possible 
that greater variability in lower achieving stu¬ 
dents’ NWF-whole scores would have been 
observed if students were able to choose their 
decoding strategy. A second limitation regard¬ 
ing our methodology is that the CBM-R score 
used in this study was averaged across two 
probes (one at grade level, one below grade 
level) instead of taking the median score 
from three grade-level probes. Third, al¬ 
though the word lists and preprimer CBM-R 
probe used in this study were developed 
based on empirical evidence, previous re¬ 
search has not demonstrated the validity of 
these measures. Given these limitations, fu¬ 
ture research should continue evaluating the 
validity of scores yielded from measures 
used in this study for universal screening. 

There are other limitations with our 
sample that may limit the generalizability of 
these findings to other populations. First, this 
study included students from a small sample 
of schools (i.e., two), and all measures were 
administered during the winter universal 
screening period. Thus, research investigating 
the validity of measures used in this study 
during other screening periods and in a larger 
sample of schools is warranted. Furthermore, 
previous research (e.g., Hosp et al., 2011) in¬ 
dicated that there may be potential bias in the 
decisions made with CBM scores based on, 
among other factors, the socioeconomic sta¬ 
tus (SES) or the race and ethnicity of stu¬ 
dents. In our sample, schools differed based 
on the percentage of students who received 
free and reduced-price meals, which is often 
a proxy for SES. However, given that we did 
not have individual SES data, we were not 
able to make comparisons based on students 
who received free and reduced-price meals 
versus those who did not, nor were we able 
to control for SES in our analyses. Further¬ 
more, the racial and ethnic composition of 
our sample, although reflective of the area in 
which participants were recruited, lacked di¬ 
versity. Therefore, future research should 
investigate whether findings differ based on 
students’ SES or racial and ethnic back¬ 
ground. 


Implications for Practice 

Results from the present study have im¬ 
portant implications for the practice of univer¬ 
sal screening in first and second grade to iden¬ 
tify students at risk for reading disabilities. 
First, the findings support the use of CBM-R 
scores for universal screening in first grade to 
identify students who are underachieving in 
reading. Furthermore, the subskill mastery 
measures failed to accurately classify first- 
grade students who were at risk, bringing into 
question the necessity of administering WIF 
probes or NWF probes that require unitization 
if CBM-R screening data are available. Nota¬ 
bly, despite publishers’ recommendations that 
NWF should be administered during first 
grade, findings indicate that NWF-whole 
should not replace CBM-R nor should NWF- 
whole be administered with CBM-R to iden¬ 
tify at-risk students, at least during the winter 
screening period. In second grade, the findings 
were less clear but suggest that administering 
either CBM-R or WIF-D for universal screen¬ 
ing may be appropriate. Furthermore, if a 
school is interested in adding a subskill mas¬ 
tery measure to CBM-R for universal screen¬ 
ing in either first or second grade, findings 
suggest that adding WIF-D in first grade or 
NWF-whole in second grade may provide the 
most accurate identification of students who 
are underachieving in reading. However, in 
first grade, differences between the classifica¬ 
tion accuracy of CBM-R alone and word list 
measures added to CBM-R were minimal (i.e., 
one to two additional students classified as at 
risk). Similarly, administering NWF-whole 
with CBM-R in second grade yielded approx¬ 
imately seven more students identified as be¬ 
ing at risk. Therefore, educators must decide 
whether it is worth the time and resources to 
increase their screening efforts in order to 
have small improvements in classification 
accuracy. 

CONCLUSIONS 

The purpose of this study was to use 
Kane’s (2013a, 2013b) argument-based ap¬ 
proach to validation as a framework to evalu¬ 
ate the interpretations and use of universal 
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screeners in first and second grade. Specifi¬ 
cally, we demonstrated that scores from the 
WIF-D, WIF-HF, and NWF-whole measures 
in this study adequately indicated performance 
in the larger domains of global reading 
achievement and early reading skill (extrapo¬ 
lation inferences). This study also evaluated 
how universal screening data are used for 
making decisions about students’ risk status, 
focusing on whether administering word lists 
to emerging readers during universal screen¬ 
ings could either improve or supplant existing 
universal screening practices. The results of 
this study confirmed, once again, that CBM-R 
is a valid, strong estimate of students’ global 
reading achievement and that CBM-R can 
classify at-risk readers in first and second 
grade accurately. Although findings indicated 
that including WIF-D (first grade) or NWF- 
whole (second grade) as supplements to 
CBM-R may provide small increases in the 
number of students identified as at risk, 
spending the additional time and resources 
required to screen all students may not be 
practical. It is also important to note that 
NWF and WIF probes are subskill mastery 
measures, which—by design—are not in¬ 
tended to be indicators of global reading 
achievement. Furthermore, if the purpose of 
universal screening within a Multi-Tiered Sys¬ 
tem of Support framework is to identify stu¬ 
dents who may be at risk for learning disabil¬ 
ity in reading, using a general outcome mea¬ 
sure (such as CBM-R) seems most 
appropriate. By using CBM-R to screen for 
at-risk readers, educators can quickly identify 
that a problem exists before following up with 
additional assessment to determine the under¬ 
lying skill deficit causing reading difficulties. 
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